对抗性的鲁棒性已成为机器学习越来越兴趣的话题,因为观察到神经网络往往会变得脆弱。我们提出了对逆转防御的信息几何表述,并引入Fire,这是一种针对分类跨透明镜损失的新的Fisher-Rao正则化,这基于对应于自然和受扰动输入特征的软磁输出之间的测量距离。基于SoftMax分布类的信息几何特性,我们为二进制和多类案例提供了Fisher-Rao距离(FRD)的明确表征,并绘制了一些有趣的属性以及与标准正则化指标的连接。此外,对于一个简单的线性和高斯模型,我们表明,在精度 - 舒适性区域中的所有帕累托最佳点都可以通过火力达到,而其他最先进的方法则可以通过火灾。从经验上讲,我们评估了经过标准数据集拟议损失的各种分类器的性能,在清洁和健壮的表现方面同时提高了1 \%的改进,同时将培训时间降低了20 \%,而不是表现最好的方法。
translated by 谷歌翻译
由于共享数据也可能揭示敏感信息,因此数据收集爆炸提高了用户的严重隐私问题。隐私保留机制的主要目标是防止恶意第三方推断敏感信息,同时保持共享数据有用。在本文中,我们在时间序列数据和智能仪表(SMS)功耗测量的背景下研究了这个问题。虽然私有和释放变量之间的互信息(MI)已被用作常见的信息理论隐私度测量,但它无法捕获功耗时间序列数据中存在的因果时间依赖性。为了克服这种限制,我们将定向信息(DI)介绍在所考虑的环境中的一种更有意义的隐私措施,并提出了一种新的损失功能。然后使用对抗的侵犯框架进行优化,其中两个经常性神经网络(RNN),称为释放器和对手,受到相反的目标训练。我们对攻击者可以访问释放器使用的所有培训数据集的最坏情况下,从SMS测量中的实证研究从SMS测量,验证所提出的方法并显示隐私和实用程序之间的现有权衡。
translated by 谷歌翻译
智能仪表(SMS)能够几乎实时与实用程序提供者的功耗。这些细粒度的信号携带有关用户的敏感信息,从隐私观点提出了严重问题。在本文中,我们专注于实时隐私威胁,即尝试以在线方式从短信数据推断敏感信息的潜在攻击者。我们采用信息理论隐私措施,并表明它有效地限制了任何攻击者的表现。然后,我们提出了一种普遍的制定来设计一种私有化机制,可以通过向SMS测量增加最小的失真量来提供目标水平。另一方面,为了应对不同的应用,考虑灵活的失真度量。该配方导致一般损失函数,其使用深入学习的对抗性框架进行了优化,其中两个神经网络 - 被称为释放器和对手 - 受到相反的目标训练。然后执行详尽的经验研究以验证所提出的方法的性能,并将其与最先进的方法进行比较,以便占用检测隐私问题。最后,我们还研究了释放者和攻击者之间数据不匹配的影响。
translated by 谷歌翻译
Mixtures of von Mises-Fisher distributions can be used to cluster data on the unit hypersphere. This is particularly adapted for high-dimensional directional data such as texts. We propose in this article to estimate a von Mises mixture using a l 1 penalized likelihood. This leads to sparse prototypes that improve clustering interpretability. We introduce an expectation-maximisation (EM) algorithm for this estimation and explore the trade-off between the sparsity term and the likelihood one with a path following algorithm. The model's behaviour is studied on simulated data and, we show the advantages of the approach on real data benchmark. We also introduce a new data set on financial reports and exhibit the benefits of our method for exploratory analysis.
translated by 谷歌翻译
G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing histogram construction as a density estimation problem and its automation as a model selection task, these histograms leverage the Minimum Description Length principle (MDL) to derive two different model selection criteria. Several proven theoretical results about these criteria give insights about their asymptotic behavior and are used to speed up their optimisation. These insights, combined to a greedy search heuristic, are used to construct histograms in linearithmic time rather than the polynomial time incurred by previous works. The capabilities of the proposed MDL density estimation method are illustrated with reference to other fully automated methods in the literature, both on synthetic and large real-world data sets.
translated by 谷歌翻译
This paper presents an introduction to the state-of-the-art in anomaly and change-point detection. On the one hand, the main concepts needed to understand the vast scientific literature on those subjects are introduced. On the other, a selection of important surveys and books, as well as two selected active research topics in the field, are presented.
translated by 谷歌翻译
Machine learning (ML) models are nowadays used in complex applications in various domains, such as medicine, bioinformatics, and other sciences. Due to their black box nature, however, it may sometimes be hard to understand and trust the results they provide. This has increased the demand for reliable visualization tools related to enhancing trust in ML models, which has become a prominent topic of research in the visualization community over the past decades. To provide an overview and present the frontiers of current research on the topic, we present a State-of-the-Art Report (STAR) on enhancing trust in ML models with the use of interactive visualization. We define and describe the background of the topic, introduce a categorization for visualization techniques that aim to accomplish this goal, and discuss insights and opportunities for future research directions. Among our contributions is a categorization of trust against different facets of interactive ML, expanded and improved from previous research. Our results are investigated from different analytical perspectives: (a) providing a statistical overview, (b) summarizing key findings, (c) performing topic analyses, and (d) exploring the data sets used in the individual papers, all with the support of an interactive web-based survey browser. We intend this survey to be beneficial for visualization researchers whose interests involve making ML models more trustworthy, as well as researchers and practitioners from other disciplines in their search for effective visualization techniques suitable for solving their tasks with confidence and conveying meaning to their data.
translated by 谷歌翻译
In recent years the applications of machine learning models have increased rapidly, due to the large amount of available data and technological progress.While some domains like web analysis can benefit from this with only minor restrictions, other fields like in medicine with patient data are strongerregulated. In particular \emph{data privacy} plays an important role as recently highlighted by the trustworthy AI initiative of the EU or general privacy regulations in legislation. Another major challenge is, that the required training \emph{data is} often \emph{distributed} in terms of features or samples and unavailable for classicalbatch learning approaches. In 2016 Google came up with a framework, called \emph{Federated Learning} to solve both of these problems. We provide a brief overview on existing Methods and Applications in the field of vertical and horizontal \emph{Federated Learning}, as well as \emph{Fderated Transfer Learning}.
translated by 谷歌翻译
Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are binarized according to a number of bins chosen by the analyst, by equal frequency discretization in the numerical case, or keeping the most frequent values in the categorical case. The second step applies a co-clustering to the instances and the binary variables, leading to groups of instances and groups of variable parts. We apply this methodology on several data sets and compare with the results of a Multiple Correspondence Analysis applied to the same data.
translated by 谷歌翻译
Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type data. In this article, we extend the latent block models based co-clustering to the case of mixed data (continuous and binary variables). We then evaluate the effectiveness of the proposed approach on simulated data and we discuss its advantages and potential limits.
translated by 谷歌翻译